Skip to content

Conversation

@andygrove
Copy link
Member

@andygrove andygrove commented Sep 15, 2025

Which issue does this PR close?

N/A

Rationale for this change

TPC-H q1 time improves from 12.55s to 11.57s (8% speedup).

What changes are included in this PR?

This PR fixes some old tech debt around the way we implemented COUNT aggregates. Originally, we tried using DataFusion's count aggregate but it was extremely slow when integrated into Comet. We did not find the root cause for this, but instead, we implemented COUNT(expr) as SUM(IF(expr IS NOT NULL, 1, 0)) and this provided better performance. However, this is not efficient since there are intermediate arrays being created from the IF expression. For COUNT(*), which Spark translates to COUNT(1), this was even more inefficient because 1is never null.

  • This PR implements specialized CountRows and CountNotNull aggregates that are faster and more memory efficient.
  • The previous SUM approach is still used for counts with multiple arguments, such as COUNT(expr1, expr2), which translates to SUM(IF(expr1 IS NOT NULL AND expr2 IS NOT NULL, 1, 0))
  • New CometFuzzTestBase extracted from CometFuzzTestSuite
  • New CometFuzzAggregateSuite

How are these changes tested?

New fuzz tests are added. In CometAggregateSuite, we did not have any tests for count(*) or for count with multiple expressions.

@codecov-commenter
Copy link

codecov-commenter commented Sep 15, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 57.51%. Comparing base (f09f8af) to head (aa6c421).
⚠️ Report is 511 commits behind head on main.

Additional details and impacted files
@@             Coverage Diff              @@
##               main    #2397      +/-   ##
============================================
+ Coverage     56.12%   57.51%   +1.39%     
- Complexity      976     1295     +319     
============================================
  Files           119      147      +28     
  Lines         11743    13469    +1726     
  Branches       2251     2352     +101     
============================================
+ Hits           6591     7747    +1156     
- Misses         4012     4457     +445     
- Partials       1140     1265     +125     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

- name: "fuzz"
value: |
org.apache.comet.CometFuzzTestSuite
org.apache.comet.CometFuzzAggregateSuite
Copy link
Member Author

@andygrove andygrove Sep 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I forgot to add this initially, and CI failed thanks to the recently added checks for this 😄

@andygrove andygrove marked this pull request as ready for review September 15, 2025 12:06
@mbutrovich
Copy link
Contributor

  • New CometFuzzTestBase extracted from CometFuzzTestSuite

Oh I love this.

This PR fixes some old tech debt around the way we implemented COUNT aggregates. Originally, we tried using DataFusion's count aggregate but it was extremely slow when integrated into Comet.

Did we try if performance improved at all recently? How long ago was this?

}

// Count all rows regardless of null values
let array = &values[0];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perhaps https://doc.rust-lang.org/std/vec/struct.Vec.html#method.get_unchecked ? we can avoid boundary check if we checked above the values is not empty

@andygrove
Copy link
Member Author

  • New CometFuzzTestBase extracted from CometFuzzTestSuite

Oh I love this.

This PR fixes some old tech debt around the way we implemented COUNT aggregates. Originally, we tried using DataFusion's count aggregate but it was extremely slow when integrated into Comet.

Did we try if performance improved at all recently? How long ago was this?

This was ~1 year ago. See #744 for more information. Count was ~10x slower than Sum at the time (but only when integrated with Comet).

@andygrove andygrove marked this pull request as draft September 16, 2025 14:54
@andygrove
Copy link
Member Author

  • New CometFuzzTestBase extracted from CometFuzzTestSuite

Oh I love this.

This PR fixes some old tech debt around the way we implemented COUNT aggregates. Originally, we tried using DataFusion's count aggregate but it was extremely slow when integrated into Comet.

Did we try if performance improved at all recently? How long ago was this?

This was ~1 year ago. See #744 for more information. Count was ~10x slower than Sum at the time (but only when integrated with Comet).

I've moved this PR to draft for now. I will split out the testing changes into a separate PR and then experiment again with using DataFusion's count.

@andygrove
Copy link
Member Author

Closing this since #2407 is less maintenance

@andygrove andygrove closed this Sep 16, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants